Advances in Neural Information Processing Systems pp MIT Press Generalization in Reinforcement Learning Successful Examples Using Sparse Coarse Coding

نویسنده

Richard S Sutton

چکیده

On large problems reinforcement learning systems must use parame terized function approximators such as neural networks in order to gen eralize between similar situations and actions In these cases there are no strong theoretical results on the accuracy of convergence and com putational results have been mixed In particular Boyan and Moore reported at last year s meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces In this paper we present positive results for all the control tasks they attempted and for one that is signi cantly larger The most important di erences are that we used sparse coarse coded function approximators CMACs whereas they used mostly global function approximators and that we learned online whereas they learned o ine Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes rollouts as in classical Monte Carlo methods and as in the TD algorithm when However in our experiments this always resulted in substantially poorer perfor mance We conclude that reinforcement learning can work robustly in conjunction with function approximators and that there is little justi cation at present for avoiding the case of general Reinforcement Learning and Function Approximation Reinforcement learning is a broad class of optimal control methods based on estimating value functions from experience simulation or search Barto Bradtke Singh Sutton Watkins Many of these methods e g dynamic programming and temporal di erence learning build their estimates in part on the basis of other estimates This may be worrisome because in practice the estimates never become exact on large problems parameterized function approximators such as neural net works must be used Because the estimates are imperfect and because they in turn are used as the targets for other estimates it seems possible that the ultimate result might be very poor estimates or even divergence Indeed some such methods have been shown to be unstable in theory Baird Gordon Tsitsiklis Van Roy and in practice Boyan Moore On the other hand other methods have been proven stable in theory Sutton Dayan and very e ective in practice Lin Tesauro Zhang Dietterich Crites Barto What are the key requirements of a method or task in order to obtain good performance The experiments in this paper are part of narrowing the answer to this question The reinforcement learning methods we use are variations of the sarsa algorithm Rum mery Niranjan Singh Sutton This method is the same as the TD algorithm Sutton except applied to state action pairs instead of states and where the predictions are used as the basis for selecting actions The learning agent estimates action values Q s a de ned as the expected future reward starting in state s taking action a and thereafter following policy These are estimated for all states and actions and for the policy currently being followed by the agent The policy is chosen dependent on the current estimates in such a way that they jointly improve ideally approaching an optimal policy and the optimal action values In our experiments actions were selected according to what we call the greedy policy Most of the time the action selected when in state s was the action for which the estimate Q s a was the largest with ties broken randomly However a small fraction of the time the action was instead selected randomly uniformly from the action set which was always discrete and nite There are two variations of the sarsa algorithm one using conventional accumulate traces and one using replace traces Singh Sutton This and other details of the algorithm we used are given in Figure To apply the sarsa algorithm to tasks with a continuous state space we combined it with a sparse coarse coded function approximator known as the CMAC Albus Miller Gordon Kraft Watkins Lin Kim Dean et al Tham A CMAC uses multiple overlapping tilings of the state space to produce a feature representation for a nal linear mapping where all the learning takes place See Figure The overall e ect is much like a network with xed radial basis functions except that it is particularly e cient computationally in other respects one would expect RBF networks and similar methods see Sutton Whitehead to work just as well It is important to note that the tilings need not be simple grids For example to avoid the curse of dimensionality a common trick is to ignore some dimensions in some tilings i e to use hyperplanar slices instead of boxes A second major trick is hashing a consistent random collapsing of a large set of tiles into a much smaller set Through hashing memory requirements are often reduced by large factors with little loss of performance This is possible because high resolution is needed in only a small fraction of the state space Hashing frees us from the curse of dimensionality in the sense that memory requirements need not be exponential in the number of dimensions but need merely match the real demands of the task Good Convergence on Control Problems We applied the sarsa and CMAC combination to the three continuous state control problems studied by Boyan and Moore D gridworld puddle world and moun tain car Whereas they used a model of the task dynamics and applied dynamic pro gramming backups o ine to a xed set of states we learned online without a model and backed up whatever states were encountered during complete trials Unlike Boyan Initially wa f Qo c ea f a Actions f CMAC tiles Start of Trial s random state F features s a greedy policy F Eligibility Traces eb f eb f b f a Accumulate algorithm ea f ea f f F b Replace algorithm ea f eb f f F b a Environment Step Take action a observe resultant reward r and next state s Choose Next Action F features s unless s is the terminal state then F a greedy policy F Learn wb f wb f c r P f F wa P f F wa eb f b f ! Loop a a s s F F if s is the terminal state go to else go to Figure The sarsa algorithm for nite horizon trial based tasks The function greedy policy F returns with probability a random action or with probability computes P f F wa for each action a and returns the action for which the sum is largest resolving any ties randomly The function features s returns the set of CMAC tiles corresponding to the state s The number of tiles returned is the constant c Q and are scalar parameters Dimension #1 D i m e n s i o n # 2 Tiling #1

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Rice Classification and Quality Detection Based on Sparse Coding Technique

Classification of various rice types and determination of its quality is a major issue in the scientific and commercial fields associated with modern agriculture. In recent years, various image processing techniques are used to identify different types of agricultural products. There are also various color and texture-based features in order to achieve the desired results in this area. In this ...

متن کامل

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding 1 Reinforcement Learning and Function Approximation 2 Good Convergence on Control Problems

On large problems, reinforcement learning systems must use parame-terized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative...

متن کامل

Image Classification via Sparse Representation and Subspace Alignment

Image representation is a crucial problem in image processing where there exist many low-level representations of image, i.e., SIFT, HOG and so on. But there is a missing link across low-level and high-level semantic representations. In fact, traditional machine learning approaches, e.g., non-negative matrix factorization, sparse representation and principle component analysis are employed to d...

متن کامل

Solving Fuzzy Equations Using Neural Nets with a New Learning Algorithm

Artificial neural networks have the advantages such as learning, adaptation, fault-tolerance, parallelism and generalization. This paper mainly intends to offer a novel method for finding a solution of a fuzzy equation that supposedly has a real solution. For this scope, we applied an architecture of fuzzy neural networks such that the corresponding connection weights are real numbers. The ...

متن کامل